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Abstract. In a knowledge discovery process, interpretation and evaluation of the mined results are indispens- 
able in practice. In the case of data clustering, however, it is often difficult to see in what aspect each cluster 
has been formed. This paper proposes a method for automatic and objective characterization or "verbalization" 
of the clusters obtained by mixture models, in which we collect conjunctions of propositions (attribute-value 
pairs) that help us interpret or evaluate the clusters. The proposed method provides us with a new, in-depth and 
consistent tool for cluster interpretation/evaluation, and works for various types of datasets including continu- 
ous attributes and missing values. Experimental results with a couple of standard datasets exhibit the utility of 
the proposed method, and the importance of the feedbacks from the interpretation/evaluation step. 

1 Introduction 

In a knowledge discovery process, interpretation and evaluation of the mined results are indispensable in prac- 
tice. In the case of data clustering 1 1 1, however, it is often difficult to see in what aspect each cluster has been 
formed, only from a list of the instances in the cluster. Visualization is a natural way for understanding things, 
and particularly in text clustering, Hotho et al. applied formal concept analysis with Hasse diagrams to visualize 
the similarity and dissimilarity among the obtained clusters |2 |. On the other hand, since there would generally 
be a physical limitation or a high implementational cost in visualization, we would rather like to "verbalize" the 
clusters, i.e. we associate an intuitive descriptive label (or a set of such labels) with each cluster Additionally 
it seems desirable that the labels are chosen objectively and automatically from the clusters. So far, there have 
been only a few labeling methods, e.g. LabelSOM |3|, Mei et al.'s automatic labeling for topic models |4| and 
others ||5|6| . CLIQUE also has a similar motivation to ours in that it performs hyper-rectangular clustering and 
at the same time produces comprehensible descriptions of the obtained clusters. 

In this paper, we propose a new labeling method that associates conjunctions of propositions (attribute-value 
pairs), called propositional labels, with the clusters obtained by mixture models. For example, consider a cluster 
C which contains several creatures such as dolphins, mink, platypus and seals. Then, letting "milk" and "aquatic" 
be the boolean attributes of the creatures, (miIk=True A aquatic =True) would be a suitable propositional label 
for the cluster C, if none of the creatures in the other clusters has these properties together. Finally we easily 
find that C is a cluster of aquatic mammals. To find these propositional labels objectively and automatically, 
we conduct an Apriori-style breadth-first search for minimal propositional labels that discriminate the cluster of 
interest from the others. Due to these features, as we will see later, the proposed method can provide us with a 
new, in-depth and consistent tool for cluster interpretation/evaluation. It is also notable that, unlike the previous 
attempts, the proposed method is fully applicable to various types of datasets including continuous attributes and 
missing values. Another novel contribution of this paper is to show empirically the importance of the feedbacks 
from the interpretation/evaluation step in achieving a reasonable clustering result. 

The rest of this paper is structured as follows. In Section |2] we describe the details of the proposed method. 
Section[3]then reports the experimental results with a couple of standard datasets. Finally, we mention the related 
work in SectionH] and conclude the paper in Section|5] 



2 Proposed method 
2.1 Preliminaries 

Before starting, let us introduce some terminology and notation. Suppose that we have a dataset D of N in- 
stances which are described by m discrete attributes Ai, A2, . . . ,A,„. Then, we simply refer to each instance by 
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a - (fli, fl2, . . . , a,„), where aj is a value of the 7-th attribute Aj of the instance. Also we write 'V(Aj) as the set 
of possible values of Aj (i.e. aj e "ViAj), I < j < m). We now introduce a propositional label (or a label, for 
short) "Xi - xi" A"X2 - X2" A • ■ • A"X„ = x„" such that {X\,X2, . . . ,X„] c {Ai, A2, . . . , A,„), Xi andX,' are distinct 
(/ 5^ /'), and x; e T^(^i)- In a probabilistic context, p{''X\ - xi" A ■ • • A "X„ = x„") = p(Xi - xi, . . .,X„ - x„) 
holds. Also, p(Z = z, . . .) for a random discrete variable Z and its value z is generally abbreviated as p{z, . . .) if the 
context is clear 

Furthermore, we add some notational conventions. First, without loss of generality, we assume that the attribute 
values are not overlapped among attributes (i.e. ^(Ay) n "ViAf) - for j + /). Then, a propositional label 
"Xi = xi" A • ■ • A "X„ - x„" is unambiguously simplified as jc = (xi A ■ • ■ A x„) x - {x\, . . . , x„). Here we have 
\x\ - n, where \x\ denotes the number of conjuncts in x, and is called the length of x. An instance a = (oi, . . . , Om) 
is also regarded as a propositional label "A; = aj" A • ■ ■ A "A,„ - fl„,". In this paper, for notational brevity, we use 
a conjunctive form and a vector form for propositional labels interchangeably depending on the context. Besides, 
to simplify the algorithm descriptions presented later, in a propositional label "Xi = xj" A ■ • • A "X„ - x„", we will 
always enumerate Xi,X2, ... so that the order of enumeration preserves the original one Ai, A2, . . ., i.e. for /i, j2, 
. . . j„ such that Xi corresponds to Aj. (1 < / < n, 1 < 7,- < m), 7, < 7,' holds when / < /'. 

Here consider a propositional label x = (xi , . . . , x„). Then, a label jr' = (x'j , . . . , x^,) is called a subconjunction 
of X if {Xj , . . . , x^, } c {xi , . . . , x„ ), and we denote this by x' Q x.lf x' c jc but x' + x, we write x' <z x. For 
an instance a and a propositional label x, we say "a satisfies x" '\f x Q a. For a boolean attribute Aj, we may 
abbreviate "Ay = True" and "Aj=False" as "Ay = T" and "Ay=F", respectively. 

2.2 Overview 

In this paper, we consider probabilistic clustering based on a simple mixture model called a naive Bayes model. A 
naive Bayes model has a latent class variable C taking on the identifiers {\,2, . . . ,K] of K clusters, and represents 
a simple joint distribution: p{C - k,A\ =«],... , A,„ - a,,,) - p(C - k) YVj=i Pi^j - '^j I C - k), or equivalently 
p{k, a) - p(k) Ylj p{aj I k). Here the probabilities p{k) and p{aj \ k) are treated as the model parameters. Given a 
dataset D of instances and the number K of clusters, we do: 

1 . Estimate the parameters in a model p{k, a) from £). 

2. Assign the most probable class k*{d) - argmaxj^^^^ p{k \ a) to each instance a based on the estimated 
parameters. The A:-th cluster Ck is then formed as a set of instances a such that k*{a) = k. 

3. Find propositional labels x that characterize well each cluster Ck- 

In the first two steps, we perform clustering, and the third step is called labeling. As is well-known, the first step 
is realized by the EM (expectation-maximization) algor ithm L8jQ From the second step, clustering can be casted 
as an unsupervised classification task, and we call p(k \ a) the (class) membership probability of an instance a. 
In the last step, it is unspecified what are the propositional labels that characterize the clusters, and how to obtain 
them. The next two sections. Sections [2 . 3 1 and l274l address these issues, respectively. 

2.3 Characteristic propositional labels 

Relevance scores: To choose suitable propositional labels jr = (xi A • ■ ■ A x„) or jc = (xi, . . . , x„), of a cluster 
Ck objectively and automatically, we introduce a scoring function that measures how relevant x and Ck are. Previ- 
ously, several relevance scores have been proposed in various statistical/data-mining tasks. The followings are an 
adaptation of those relevance scores to our labeling problem: 

- Growth rate: GR/t(j!:) - pix \ k)/ pix \ -^k), where -ifc indicates that the instance under consideration belongs 
to a class other than k. This score is mainly used in emerging pattern mining jO) and explicitly states that the 
instances satisfying x are likely to occur in the cluster Ck and unlikely to occur in the clusters other than Ck- 
GRkix) ranges from (when p(x | A;) = 0) to 00 (when p(x | A;) > and p(x \ ^k) - 0). 

- Membership probabilities: p{k \ x)- PRIM, a rule-based method for bump hunting, tries to find x such that 
p{k \ x) > r, where r is some threshold, under a separate-and-conquer strategy lITOl . It is crucial to see that, 
for a fixed k, p(k \ x) - p(k)p{x \ k)/p(x) oc p(x \ k)/ p(x) holds. In class association rule (CAR) mining IfTTI . 
p(k I x) is called the confidence of a rule x Ck- 
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- Pointwise mutual information: PMIt(jr) - log p(k, x) -log{p{k)p(x)} . PMI has been used in text analysis fT2l. 
This score is rewritten as log p(x | k) - log pix), which is adopted by a well-known probabilistic clustering 
tool AutoClass [131 for post-analysis (named "attribute influence values"), in a limited case with = 1. The 
non-logarithmic version p{k, x) /{p{k)p(x)) is called the lift of a class association rule x ^ Ck [14|. 

- Leverage: Leverage|(,(jc) - p{k,x) - p(k)p{x). This score is often used for finding interesting association 
rules L15i . Leverage^^(jc) is equivalent to the weighted relative accuracy (WRAcc), a score used in subgroup 
discovery, and can be rewritten as p{x)(p(k \ x)- p(k)) or p(k)p(^k)(p{x \ k)-p{x \ -ik)) lfT6l . A related score 
\p(x I k) - p{x I often called support difference, is used in contrast set mining [UT). 

- TF-IDF: TF-IDF|(.(jc) - p{x \ k)log{l / p(x)}. This is a popular measure in information retrieval fTSl, and is a 
product of term frequency (TF) and inverse document frequency (IDF). TF of a term f in a document d is the 
relative frequency of t occurring in d, and IDF of t is the logarithm of the inverse of the relative frequency that 
a document containing t occurs in the whole document set. Then, assuming that a term occurs at most once in 
a document, the TF-IDF of a term f in a document d is given as p(t \ d) log{ !//?(?)}. Since TF-IDF is known 
to give a reasonably high score to f that characterizes d, TF-IDFi(jc) above can be used by analogy where f 
corresponds to x, and d corresponds to k. 

- Precision/Recall: Precision and recall are also popular measures in information retrieval. In our context, 
p(k I x) and p(x \ k) can be regarded as precision and recall of label x for the fc-th cluster 1 14|. Also in 
COBWEB 1 19 1, a well-known conceptual clustering method, p{k \ x) and pix \ k) are respectively used as 
metrics for inter-class dissimilarity and intra-class similarity. To balance the opposite behavior of precision 
and recall, in information retrieval, we often use thier harmonic mean 2p{k \ x)p(x \ k)/{p(k \ x) + pix \ k)) 
and call it the F-score. Lamirel et al. proposed the use of the F-score for automatic labeling of clustering 
results |6 |. Similarly, the product of precision and recall pik \ x)pix \ k), which substantially works as the 
geometric mean of pik \ x) and pix \ k), is used by Popescul and Ungar (|5|. 

Other relevance scores are discussed in comprehensive surveys by Kralj Novak et al. lfT6l and by Geng et 
al. ||T4|. It is easy to show that pik \ Xi) < pik \ xi) iff GR<:(a:i) < GRj:(j:2)|1 and pik \ X\) < pik \ X2) iff" 
VMlkixx) < Y'Mliiixi). Consequently, for a particular cluster Ck, the first three scores give the same ranking over 
the propositional labels. Hereafter we call pix \ k) the local support, and pix) the global support. The relevance 
scores above commonly rely on the local support with a penalty regarding the global support. This contrastive use 
of the global support and the local support is also found in the category utility adopted in COBWEB 1 191 . 

In this paper, we choose pik \ x) as the relevance score for two reasons on intuitiveness for the end users. First, 
we can of course interpret pik \ x) as discriminative probabilities, by which we classify an instance satisfying x. 
As mentioned in Section |Z21 clustering is performed based on the membership probabilities pik \ a), which are 
a special case of pik \ x). The second reason is more practical: pik \ x) is inherently normalized (i.e. < pik \ 
x) < 1). From this nature, we can use a threshold r, which just ranges over (0, 1] and is commonly applied to all 
clusters, to filter out x such that pik \x) < r. 

Minimality: Let us consider two propositional labels X\ and xi that fulfill some requirement (e.g. pik \ X\) > r 
and pik \ xi) > r for some threshold r), and also suppose that X\ Q X2 holds. In such a case, we favor Xi over X2, 
because the longer one may have some redundant information which hinders us from understanding the cluster. 
In other words, we would like to have only minimal labels. In the literature on emerging pattern mining, such 
minimal patterns are called essential emerging patterns |2()j, and Ji et al. proposed an eflicient mining algorithm 
named ConSGapMiner for minimal distinguishing sequences ||2TI . 

Model-based computation of relevance scores: We have introduced several relevance scores which are based 
on probabilities. In most of the previous work, these probabilities are directly estimated from a given dataset D 
of instances. For example, membership probabilities are estimated as pik \ x) - \{a e Ck \ x Q a}\ / |{a e D | 
X c a)|. In our method, on the other hand, relevance scores are computed from the model parameters via the 
joint distribution (Section [2.2b . This model-based approach has a couple of advantages. First, as seen later, we 
can efficiently compute the scores, exploiting the conditional independence in the model, without scanning the 
whole dataset £). In many cases, the space for the model parameters is much smaller than the dataset. The second 
advantage is that the model parameters are well-abstracted data as long as the model fits to £), and there would 
be less chance to be affected by noise. Finally, there is a positive side-effect that we need not care about missing 
values in D since we only use the parameters estimated by the EM algorithm. 

2 GR,(x) = ip{k)lpi^k))-\pik I x)lpi-.k I x)) oc pik I x)/(l - Pik I X)). 



Selecting characteristic prepositional labels: Now based on the discussions above, we define characteristic 
prepositional labels, which characterize well the obtained clusters. A propositional label x of the cluster Ca is 
characteristic iff: 

1. pik I Jc) > r, 

2. p(x) > iglobal, 

3. p{x \k)> iiocai, and 

4. There is no x' c that satisfies 1 ~3 above, 

where r, Sgiobai and siocai are user-specified thresholds, and the probabilities p{k \ x), p{x) and p{x \ k) are computed 
via the joint distribution. Conditions[TH4]are called the relevance condition, the global support condition, the local 
support condition, and the minimality condition, respectively. 

While most of the existing CAR mining algorithms run based on the guide from the threshold for p{x \ k), we 
treat the first and the fourth conditions as the primary filters. The remaining conditions are introduced to remedy 
the problem that we often obtain unintuitive characteristic labels with very low global/local support, and also to 
reduce the burden in the exhaustive search for characteristic labels, which will be described in the next section. 
So currently we do not consider to put a tight restriction on global/local support (e.g. iiocai = ^/i\D\/K) = K/\!D\, 
which implies that each of equally-sized clusters should contain at least one instance). 

2.4 Exhaustive search for characteristic propositional labels 

All possible propositional labels form a version space ||22l, and on this structure, we conduct an Apriori-style 
breadth-first search for the entire set of characteristic labels for each cluster. There are two major styles for such 
an exhaustive search: depth-first and breadth-first. We take a breadth-first style because, as seen later, it is easier 
to check the minimality of characteristic labels in a breadth-first stylej^ and because we do not necessarily need 
very long characteristic labels that are difficult to read. 

The Find procedure (Algorithm[TJ is the main routine of the search algorithm for characteristic labels, which 
calls the GenCandidate function (Algorithm|2]i. The basic flow is similar to Apriori (GenCandidate is our version 
of the apr ior i-gen function in |23|), but is different in that we make probability computation while generating 
candidates. In addition, since this probability computation requires normalization for each membership probability 
p{k I x), the most part of the algorithm should work in parallel for clusters. It is also crucial to note that the 
global/local support of x {p{x) and p{x \ k)) are anti-monotonic w.rt. the inclusion relation (i.e. p(x \ k) > p(x' \ k) 
if X Q x'), but in general our relevance score is not. Instead, like ConSGapMiner, we make pruning based on the 
minimality of characteristic labels. 

In the Find procedure, for each Ck, Sn[k] indicates a set of propositional labels of length n that satisfy the 
global/local support condition, and R„[k] indicates a set of labels in S„[k] that additionally satisfy the relevance 
condition. R„[k] are the characteristic labels of length n which we wish to have, and we do not extend the labels 
in R„[k]. W„[k] = S„[k] \ R„[k] are therefore the labels to be worked on next. 

The candidate labels of length (n + 1) are generated from the GenCandidate function, in which the labels 
of length n in W„[k] are combined effectively. In Line |5] of GenCandidate, like the "prune" step of Apriori, 
"SuBCoNj(ji:ext) £ W„[A;]" filters out the over-generated candidate labels using anti-monotonicity of global/local 
support and minimality at the same time. SubConj(ji:) is a function that returns a set of x's subconjunctions of 
length \x\-lE and using the property that Wn[k] - S„[k] \ R„[k], the filtering condition requires that each of the 
immediate subconjunctions of jCext should be in S „ [k] (due to anti-monotonicity), but should not be in R„ [k] (due to 
minimality). This way of filtering, together with the breadth-first strategy, enables us to perform effective pruning 
by only checking the labels in W„[A;]|f| Then, for each candidate label that has passed the filter, we compute the 
probabilities p(x \ k), p{x) and p{k \ x) (Lines fTTHTSl ). The point here is that we take the union of C„+i[fe]'s in 
advance (Line fTTI) to avoid a redundant computation, and reuse the previously computed values for piXpr^Y I k), 
exploiting the conditional independence in the naive Bayes model. 

To speed-up further the search algorithm in the case with many attributes, in the Find procedure, we optionally 
introduce a greedy pruning, similarly to a commercial data-mining tool named Magnum Opus lil5J . To be more 

^ ConSGapMiner mentioned above works in a depth-first fashion, and needs to introduce an extra data structure (a prefix 

tree) to reduce the time for the post-check on minimality. 
"* More specifically, for x = (xi, . . . , x„-i,x„), SubConj(x) = {{x2, x^, . . . , x„-i,x„), {xi, X3, . . . , x„_i, x„), . . . , (xi, X2, . . . , x„_2, 

X„), (X\,X2, . . . , X„_2, -«n-l)}. 

^ Wn[K\ is constructed from the labels in W„_i[fc], and hence is guaranteed not to include any labels x such that x' Q x, 
x' e R„, [k] and \ <n' <n. 



Algorithm 1 Find 



1: foralU= 1,2, ...,A:do 

2: 5i[^:] := [oj \ I < j < in, a, 6 n^(A,),p(a^) > igiobai, p(aj | k) > ^^1) 

3: Rdk]:={ajeSdk]\p(k\aj)>r] 

4: Wdk]-S,lk]\Rdk] 

5: end for 

6: 

7: n :=l 

8: while 3yt : W„M 9^ do 

9: <C„+i[l], . . . , C„^dK]} := GenCandidate(M/„[1], . . . , W„[K]) 
10: foraU/t= 1, 2, . . . , AT such that C„+i[/t] 9^ do 
11: 5,,+iM := {x € C„+iM | p(x) > igiobai, I k) > s,„^^,] 
12: R„^dk] ■-{x€S„^dk]\p(k\x)>r] 
13: W„+iM :=5„+iM\W„+iM 

14: end for 
15: n := n + I 
16: end while 
17: 

18: return <U„W„[1], U„^«m> 



concrete, we delete aj such that /7(A: | aj) < p(k) from Sdk} after Line|2] In addition, x - (x\, . . . ,x„, x„+\) e 
Sn+ilk} such that p(k \ x) < p(k \ x'), where x' - {x\,. . ., x„), are considered as unpromising, and deleted from 
S „+i {k} after Line[TT] This greedy pruning is unsafe, i.e. we may miss some characteristic labels actually satisfying 
the conditions in Section |231 but it would bring high efficiency in many practical cases. 

2.5 Handling continuous attributes 

Until now, we have assumed that all attributes are discrete. To handle continuous attributes and discrete attributes 
consistently in terms of membership probabilities, we also "propositionalize" each continuous attribute. To be 
more specific, as is often done in mixture modeling, we consider that each continuous attribute follows a univariate 
Gaussian distribution, in which two types of parameters, the mean p.jk and the variance cr^^, are introduced for 
the 7-th continuous attribute Aj and the A:-th cluster Ck- These parameters are also estimated by the EM algorithm. 
We further assume that we are given a set (3 = {q\,q2, ■ ■ ■, <J\Q\] of different probabilities, where Q < qi, < I for 
1 < h < \Q\, and the indices are given so that q^ < qh' \fh< h' . For instance, we may have Q — {0.1, 0.2, . . . , 0.9}. 
Then, using a cumulative distribution function F^k with the mean p.j^k and the variance cr^^ for each Ay and Ck, we 

introduce "al/'*' < Aj < jSj^-''*'" as a conjunct in a propositional label, where a^j^'''^ = fij^k-d^j,'''^ and ySj^-''*^'' = fj.j^k+d^;:!''^^ 
such that F j^kiP^i'''^) - P i,k{ci'l''^^) - qh- It can be seen here that aj^-''*^'' and ySl^''*^'' are symmetric w.rt. the mean ^jk- 
Hereafter we omit the superscript {J, k) unless they are needed. 

Now consider a: = (jcq A "ff/, < X„ < y6/,") and jc' - {xoA^'ai,, < X„ </?/,-") where /i < /i'. Then, we define jc' cx 
and we have p{x) < p(x'). With this new inclusion relation, in the search algorithm, an additional minimality 
check is made for the last conjunct corresponding to a continuous attribute, just after R„ [k] being computed^ One 
may see that ai,'s and j8/,'s above are model-based quantile values0 and choosing an appropriate "a/, < Aj < f3h" 
leads to an automatic adjustment of {ah,/3h), which resembles the 'peeling' operation in PRIM, a rule-based bump 
hunting method ITOl . 

3 Experiments 

In the experiments, we used four datasets: the zoo dataset, the iris dataset, the 20 newsgroup dataset and the flags 
dataset0 For the first three datasets, we gave the correct number K of clusters to the clustering algorithm, con- 
sidering ideal situations. We then compare the obtained characteristic labels and the original (human-annotated) 

* To be specific, after Lines[3]and[T2]in the Find procedure, non-minimal labels are deleted from both R„[k] and 5„[A:]. 
^ For instance, if qi, is given as 0.9, a/, and fSi, respectively correspond to the 5%-tile value and the 95%-tile value under the 
Gaussian distribution. 

^ The zoo dataset, the iris dataset and the flags dataset are available from the the UCl ML Repository 
Jhttp : / /archive . ics . uci . edu/ml/[) , and the 20 newsgroup dataset is available from the UCl KDD Archive 
jhttp ://kdd.ics.uci. edu71 >. 



Algorithm 2 GenCandidate(W„[1], . . . , Wn[K]) 

1: foralU= 1,2, do 
2: C„+iM:=0 

3: for all x = (xi, . . . ,x„_i, j:„) e \y„M andjc' = (xi, . . . ,x„_i, x^) € W„[k] suchthatVy: x,„x^ ^ "yCA,) do 

4. -^ext (-^li • ■ ■ 5 X,j_i , X„, X^j) 

5: if SuBCoNj(Xext) C lV„[/t] then 
6: C„+,M :=C„+,MU{x,„} 

7: end if 
8: end for 

9: end for 
10: 

11: A,+ i := UtiC„+iM 

12: for all X = (xi, . . . , x„_i, x„, x„+i) e Z)„+i do 

13. .^prev (X] , . . . , X;,) 

14: I ^) := pCXp^v I ^)p(x„+i U) foik= l,...,K 

15: p(A:):=ZtiPWp(^|fc) 

16: end for 
17: 

18: | x) := /7(yt)p(x | yt)//7(x) for yt = 1, . . . , A" and x e C„+iM 
19: 

20: return <C„+i[l], . . . ,C„+im> 



classes. On the other hand, since the flags dataset does not contain the class information, we explore a plausible 
number of clusters by characteristic labels together with a Bayesian score for model selection. For simplicity, 
throughout the experiments, we set a small value (1/|£)|) to the threshold igiobai for the global support pix), so that 
the influence from igiobai is negligible. In addition, we tried 1,000 re-initializations in the EM algorithm not to get 
trapped into unwanted local optima. 

3.1 Zoo dataset 

The zoo dataset describes the classification of 101 species of creatures with 17 attributes. The species are originally 
categorized into seven classes. Table [T] (top-left) shows the confusion matrix of the clustering result. We can see 
from this matrix that the creatures in the class "mammals" are split into two clusters Ci and C2, whereas the 
creatures in "reptiles" and "amphibians" are merged into cluster C5. Besides, the remaining tables in Table[T]show 
the obtained characteristic labels for Ci, C2 and C3, where C3 corresponds to the original class "birds." The labels 
in the tables are ordered firstly according to the length of x (i.e. syntactic generality), secondly according to the 
magnitude of p(x \ k) (i.e. statistical generality), and thirdly according to the magnitude of p{k \ x)^ We used 
r - 0.9 and iiocai = K/\D\ as the thresholds for p{k \ x) and p(x \ k), respectively, where D is the dataset and 
= 3 is the number of clusters. 
Since the original classes are unknown in real situations, we interpret the clusters Ci, C2 and C3, only from the 
obtained characteristic labels. For example, all creatures in C3 have feathers, so we can guess that C3 corresponds 
to birds. Also there are several plausible labels for C3 which support our guess. Interestingly, on the other hand, the 
obtained labels indicate that the (wrongly) split classes C\ and C2 correspond to terrestrial and aquatic mammals, 
respectively. So one may conclude that these split clusters are still meaningful. In the past, to evaluate the quality of 
the obtained clusters, there has been no way but to numerically check the closeness between the obtained clusters 
and the human-annotated classes, using some matching criteria, such as purity, normalized mutual information 
and the (adjusted) Rand index |18 24|. Contrastingly, as seen above, the characteristic labels provide us with a 
new and in-depth way for cluster evaluation. Similar interpretations are possible for the other clusters, whose 
characteristic labels are shown in Table|2l 

3.2 Iris dataset 

As a typical continuous dataset, we picked up the iris dataset, in which there are four attributes: petal width, petal 
length, sepal width and sepal length. Each of 150 cases in the dataset originally belongs to one of three classes: 
Setosa, Versicolour and Virginica. The confusion matrix and the obtained labels are shown in Table [3] We used 



' We also observed that intuitive labels tend to be highly ranked according to the harmonic mean of p(k | x) and p(x | k). 



Table 1. The confusion matrix, and the characteristic labels for the clusters Ci, C2 and C3 in the zoo dataset. 



clusters 



original classes 


Ci 


Ci 


C3 


C4 


Ci 


C6 


Ci 


mammals 


35 


6 

















birds 








20 














fishes 











13 











amphibians 














4 








reptiles 














5 








insects 

















8 





others 














1 


2 


7 



labels for C2 


p{k\x) 


p(x\k) 


milk: 


=T 


A aquatic =T 


1.000 


1.000 


breathes=T 


A fins=T 


1.000 


0.666 


milk: 


=T 


A fins=T 


1.000 


0.666 


hairs 


=T 


A aquatic =T 


1.000 


0.666 


eggs: 


=F 


A fins=T 


1.000 


0.555 


milk: 


=T 


A legs=0 


1.000 


0.500 


hairs 


=T 


A fins=T 


1.000 


0.444 


hairs 


=F 


A milk=T 


1.000 


0.333 


fins= 


T 


A legs=4 


1.000 


0.222 



labels for C\ 


p{k\x) 


p(x\k) 


milk=T 


A aquatic= 


:F 


1.000 


1.000 


eggs=F 


A aquatic= 


:F 


0.972 


1.000 


milk=T 


A fins=F 




0.945 


1.000 


hair=T 


A toothed: 


=T 


0.913 


1.000 


hair=T 


A eggs=F 




0.913 


1.000 


eggs=F 


A fins=F 




0.905 


1.000 


hair=T 


A tail=T 




0.900 


0.857 


hair=T 


A legs=4 




0.956 


0.828 


milk=T 


A legs=4 




0.935 


0.828 


eggs=F 


A legs=4 




0.910 


0.828 



labels for C3 


p(k\x) 


p{x\k) 


feathers=T 






1.000 


1.000 


milk=F 


A 


legs=2 


1.000 


1.000 


toothed=F 


A 


legs=2 


0.991 


1.000 


eggs=T 


A 


legs=2 


0.991 


1.000 


hairs=F 


A 


legs=2 


0.983 


1.000 


airborne=T 


A 


legs=2 


0.979 


0.800 


airborne=T 


A 


tail=T 


0.903 


0.800 


legs=2 


A 


catsize=F 


0.900 


0.700 


airborne=T 


A 


aquatic=T 


1.000 


0.240 
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Fig. 1. Scattered plots of the iris dataset (right) for sepal-length vs. sepal-width and (left) for petal-length vs. 
petal-width. 



the thresholds r = 0.9 and s\oca\ - A candidate set Q of cumulative probabilities (introduced in Section [23] ) 

is {0.2, 0.4, 0.6, 0.8). The scattered plots in Fig. [T]tell us that the obtained characteristic labels adaptively capture 
the dense part of cluster C \ . Also it should be noted that, in the proposed method, the Euclidean distance from the 
center of the cluster is translated into a cumulative probability under a Gaussian distribution. 



3.3 20 newsgroups dataset 



The 20 newsgroups dataset is originally a collection of approximately 20,000 articles from 20 different news- 
groups. A preprocessed dataset available from http : / /people . csail .mit . edu/ jrennie/2 ONewsgr oup 
is used, and the articles from three newsgroups: comp . sys . ibm . pc . hardware, rec . sport . hockey and 
soc . religion . christian. We made further preprocessing: stemming by the Porter's algorithm [18], re- 
moving infrequent words (< 200 occurrences), removing short articles (< 10 words) and removing the attributes 
taking only one value. The dataset was finally converted into 2,799 bag-of-words boolean vectors whose dimension 
is 2,016. In labeling by the proposed method, we did not use the conjuncts of the form "w = False" (or 'V = F") 



Table 2. The obtained characteristic labels for clusters C4, C5, Ce and C7. 



Iciucla lui 04 


pyK.\x ) 


p\X\K) 


milk=F A fin=T 


1.000 


1.000 


breathes=F A tail=T 


0.948 


1.000 


eggs=T A fin=T 


0.951 


1.000 


toothed=T A breathes=F 


0.941 


1.000 


backbone=T A breathes=F 


0.935 


1.000 


breathes=F A fin=T 


1.000 


1.000 


hair=F A fin=T 


0.906 


1.000 


fin=T A catsize=F 


1.000 


0.692 



labels for C7 


p(k\x) 


p(x\k) 


legs=5 


1.000 


0.142 


backbone=F A breathes=F 


0.985 


1.000 


toothed=F A breathes=F 


0.972 


1.000 


breathes=F A tail=F 


0.958 


1.000 


aquatic=T A backbone=F 


0.922 


0.857 


breathes=F A legs=6 


1.000 


0.285 


aquatic=T A legs=6 


1.000 


0.245 


backbone=F A legs=8 


0.908 


0.142 


backbone=F A catsize=T 


0.908 


0.142 



labels for C5 


p(k\x) 


p(x\k) 


venomous= 


T A legs=4 




0.943 


0.239 


eggs=F 


A milk=F 




1.000 


0.199 


milk=F 


A toothed=T A fin=F 




1.000 


0.799 


hair=F 


A toothed=T A fin=F 




0.935 


0.799 


milk=F 


A toothed=T A breathes=T 




1.000 


0.719 


milk=F 


A breathes=T A legs=4 




1.000 


0.539 


feathers=F 


A milk=F A backbone=T A fin= 


=F 


1.000 


0.899 


hair=F 


A feathers=F A backbone=T A fin= 


=F 


0.931 


0.899 



labels for Ce 


p(k\x) 


p(x\k) 


backbone=F A breathes=T 




0.916 


1.000 


predator=F 


A backbone=F 




0.978 


0.899 


breathes=T 


A legs=6 




1.000 


0.800 


aquatic=F 


A legs=6 




0.965 


0.800 


predator=F 


A legs=6 




1.000 


0.720 


airbome=T 


A backbone=F 




1.000 


0.600 


feathers=F 


A eggs=T 


A airbome=T 


1.000 


0.600 


feathers=F 


A airborne=T 


A toothed=F 


1.000 


0.600 



which means the absence of word w in the article. The thresholds r and siocai were respectively configured as 0.9 
and 10 X K/\D\^ Furthermore, we applied the greedy pruning described at the last of Section |Z41 

The results are shown in Table |4] From the obtained characteristic labels for C\, it is seen that the article 
containing words such as "hockei" ("hockey"; the suffix should have been replaced by the stemmer) and "nhl" 
("NHL"; the National Hockey League) are likely to belong to Ci . There are also the names of a hockey team and its 
home city (i.e. Pittsburgh Penguins). So we can guess from this information that Ci is a cluster of articles related 
to hockey. Similarly, it is easy to see that C2 is a cluster of articles related to computer hardware03 from the words 
such as "mb" ("megabytes" or "motherboard"), "disk" and "motherboard." C3 would be understood as a cluster 
that contains the articles talking about religious matters. Although there are many attributes in this dataset, our 
search algorithm is feasiblel3 thanks to the pruning based on the minimality and the optimized setting described 
above. 



3.4 Flags dataset 

The flags dataset contains the details of 194 national flags, originally described by 30 attributes. In this experiment, 
we focused on the clusters of national flags grouped on their visual aspects, and hence non-visual attributes 
(landmass, zone, area, population, language and religion) were removed in advance. As is written above, since 

-Siocai was configured as 10 x K/\D\ because the 20 newsgroup dataset is 10 times (or more) larger than the zoo and the iris 
dataset. 

" As shown in the confusion matrix in TableO C2 contains the articles from soc . religion . christian, but the char- 
acteristic labels related to religion did not appear. This would be because the articles from soc . religion . christian 
mainly use non-technical terms, which are less likely to form characteristic labels. 

It took 404 seconds on a PC with Core 17 2.66GHz to get all characteristic labels for all clusters. Currently the search 
algorithm is implemented in the Ruby script language. 



Table 3. (top) The confusion matrix in clustering the iris dataset, and (bottom) the obtained labels. 



original classes 


clusters 
Ci C2 C3 


Setosa 
Versicolour 
Virginica 


50 
45 5 
50 



labels tor C'l 


p{k\x) 


p(x\k) 


0.06 < petal-w < 0.43 

1.2 < petal-1 < 1.7 
3.3 < sepal-w < 3.5 

4.3 < sepal-1 < 5.6 A 3.0 < sepal-w < 3.8 
4.9 < sepal-1 < 5.1 A 2.8 < sepal-w < 4.0 


0.999 
1.000 
0.953 
0.978 
0.926 


0.800 
0.799 
0.199 
0.480 
0.159 




labels for C2 


p(k\x) 


p(x\k) 


4.0 < petal-1 < 4.4 

5.5 < sepal-1 < 6.4 A 0.99 < petal-w < 1.6 
5.8 < sepal-1 < 6.0 A 2.6 < sepal-w < 2.9 


0.931 
0.979 
0.964 


0.799 
0.479 
0.160 




labels for C3 


p(k\x) 


p{x\k) 


4.5 < petal-1 < 6.4 
6.4 < sepal-1 < 6.7 

1.7 < petal-w < 2.2 

2.8 < sepal-w < 3.1 A 1.6 < petal-w < 2.4 
2.8 < sepal-w < 4.0 A 1.6 < petal-w < 2.4 


0.985 
0.904 
0.942 
0.918 
0.947 


0.800 
0.800 
0.400 
0.480 
0.446 



the class information is not given in this dataset, we first estimated the number of clusters as K by the Cheeseman- 
Stutz score lfT3l . a Bayesian model selection criterion adopted in AutoClass, and then starting from K, we explored 
a plausible number of clusters by observing the characteristic labels. Another point in this dataset is that discrete 
attributes and continuous attributes are mixed. That is, all of eight integer attributes (e.g. the number of circles in 
the flag) were treated as continuous attributes. We used r = 0.75 and siocai - K/\D\ as the thresholds for p{k \ x) 
and p(x I k), respectively, where D is the dataset and K is the number of clusters. Also we conducted the greedy 
pruning. 

Fig.|2]shows the curve of the Cheeseman-Stutz score with various numbers of clusters, and we have A' = 5 as 
a peak of this curve. We further continued to compute characteristic labels with the number K of clusters being 
around K, and found that readable characteristic labels are obtained when K - 6. Table |5] presents these labelsPI 
The shortest characteristic label for the cluster Ci says that the national flags in Ci (and none in the other clusters) 
have one saltire (diagonal cross). A typical example of such flags is the Union Jack, and actually many flags in 
Ci have one quartered section (i.e. #quarters=l) for the Union Jack. Similarly, the clusters C2 and C3 contain the 
flags with vertical bars and with circles, respectively. The label (#saltires=0 A #quarters=l) for Cg distinguishes 
Ci and Ce, and similarly the labels (#crosses=l A #saltires=0) and (#crosses=l A #quarters=0) for C4 jointly 
work for distinguishing C4 from C\ and Ce, where #crosses indicates the number of upright crosses. Indeed, Ce 
contains the flag of the United States, and C4 contains the flags of several Scandinavian countries (note that the 
Union Jack also contains upright crosses). From the labels for C5, one may see that C5 is a cluster of miscellaneous 
flags. On the other hand, when the number K of clusters is set at K - 5, the clusters C2 and C3 are merged into 
one cluster, whose characteristic labels are not so intuitive as in Table |5] These results imply that a plausible 
number of clusters can be determined by interactively consulting characteristic labels, with a help from model 
selection techniques, and clearly exemplify how the feedbacks from the interpretation/evaluation step contribute 
in knowledge discovery. 



4 Related work 

As mentioned above, there have been only a few labeling approaches. LabelSOM |[3l is a labeling method for 
self-organizing maps, and Mei et al.'s automatic labeling method for unigram topic models [4| uses a heuristic 
score based on pointwise mutual information. As described in Section |231 different relevance measures are used 

Since each continuous attribute is originally an integer attribute, a proposition "a < Aj < yS" (assume here that a and 
y6 are not integers, for simplicity) was translated back into "Aj = la~\, la~\ + 1, . . . , \J3j" in Table [5] Non-minimal labels 
produced by this translation were then removed. 



Table 4. The confusion matrix, and the characteristic labels for the clusters Ci, C2 and C3 in the 20 newsgroups 
dataset. 



original classes 


clusters 
Ci C2 C3 


comp . sys . ibm . pc . hardware 

rec . sport . hockey 

soc. religion. christian 


907 7 
899 28 10 
2 371 575 



labels for Ci 


p(k\x) 


p(x\k) 




labels for C2 


p(k\x) 


p(x\k) 




labels for C3 


p(k\x) 


p(x\k) 


game=T 


0.930 


0.552 




card=T 


0.959 


0.217 




divin=T 


0.950 


0.099 


team=T 


0.976 


0.487 




pc=T 


0.972 


0.167 




fals=T 


0.904 


0.097 


hockei=T 


0.963 


0.381 




mb=T 


0.993 


0.132 




condemn=T 


0.904 


0.096 


player=T 


0.959 


0.319 




bu=T 


0.929 


0.132 




reveal=T 


0.931 


0.094 


playoff=T 


0.969 


0.278 




disk=T 


0.969 


0.124 




societi=T 


0.921 


0.081 


season=T 


0.964 


0.248 




window=T 


0.958 


0.124 




kingdom=T 


0.920 


0.068 


nhl=T 


0.989 


0.217 




instal=T 


0.926 


0.106 




guilti=T 


0.963 


0.049 


cup=T 


0.927 


0.200 




driver=T 


0.903 


0.094 




innoc=T 


0.932 


0.049 


score=T 


0.936 


0.198 




motherboard=T 


0.990 


0.092 




israel=T 


0.959 


0.043 


leagu=T 


0.968 


0.174 




ibm=T 


0.966 


0.092 




social=T 


0.942 


0.030 


wing=T 


0.905 


0.159 












diseas=T 


0.909 


0.018 


pittsburgh=T 


0.956 


0.149 




batteri=T 


0.993 


0.011 




islam=T 


0.909 


0.018 


toronto=T 


0.922 


0.145 




drive =T A work=T 


0.903 


0.053 




jehovah=T 


0.989 


0.015 


leaf=T 


0.968 


0.137 




drive=T A system=T 


0.971 


0.049 










detroit=T 


0.983 


0.135 












explor=T 


0.986 


0.012 


bruin=T 


0.990 


0.134 








christian=T A god=T 


0.954 


0.437 


penguin=T 


0.982 


0.134 








peopl=T A god=T 


0.928 


0.435 


knock=T 


0.916 


0.013 














year=T A plai=T 


0.919 


0.147 














ca=T A plai=T 


0.932 


0.125 














articl=T A fan=T 


0.902 


0.121 














plai=T A win=T 


0.985 


0.115 















by Popescul and Ungar (5) and by Lamirel et al. fSl for automatic labeling of document clusters. In these labeling 
methods, the length of possible labels seems to be limited in advance, and thus no pruning mechanism, like the 
one described in Section 1241 is given. 

CLIQUE [71 is a novel hyper-rectangular clustering method that additionally gives comprehensible descrip- 
tions of the obtained clusters. The description of each cluster is a DNF formula of the ranges of continuous 
attributes such as ((30 < age < 50) A (4 < salary < 8)) V ((40 < age < 60) A (2 < salary < 6)). Although CLIQUE 
has a similar motivation to ours, it is mainly designed for the dataset with continuous attributes. According to the 
original description ("Remarks" in Section 2.2 of [7]), if we use discrete attributes, all instances in a cluster must 
take the same value for each discrete attribute in a selected subspace. In the proposed method, contrastingly, we 
do not have such a restriction, and as seen in Section 13.41 we can make use of advanced statistical techniques 
such as ones for model selection in the clustering step. The latter point also contrasts the proposed method with 
conceptual clustering methods such as COBWEB 1 19|. 

In the research on expert systems, it has been a problem to explain the expert system's conclusion to human 
users. Wolverton [25 1 proposed the use of satisficing conclusion-substantiating (SCS) explanations to explain an 
expert system's conclusion. Given a system's conclusion c and a threshold p, the SCS explanation e is the shortest 
sequence of facts such that p(c \ e) > p (or if no such sequence of facts, e = argmaXg, p(c | e')). Our search 
algorithm would contribute in efficient finding of SCS explanations. 

Traditional rule induction methods such as C4.5 and RIPPER !l26ll can also be applied to find comprehensible 
cluster descriptions. However, Hotho et al. reported that these methods tend to produce too many rules to manage 
for human [2[. One possible reason is that C4.5 and RIPPER have a representational limitation that the premises 
in the obtained rules are always exclusive and need to be understood fragmentarily. In the proposed method, 
on the other hand, each characteristic label is independently interpre table. Another possibiUty is that C4.5 and 
RIPPER tried to find the exact boundaries among clusters, by their design. In labeling, however, we do not always 
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Fig. 2. The Cheeseman-Stutz scores with various numbers of clusters. 

Table 5. The characteristic labels for Ci, . . . , Ce in the flags dataset. 



labels for Ci 


p(k\x) 


P(x\k) 


#saltires= 1 


1.000 


0.900 


topleft=white A #quarters=l 


0.817 


0.622 


stripes=0,l,2 A #quarters=l 


0.827 


0.540 


botright=blue A #quarters=l 


0.819 


0.505 


green=T A #crosses= 1 


0.906 


0.467 


gold=T A #crosses=l 


0.763 


0.467 


mainhue=blue A #quarters=l 


0.810 


0.467 


#crosses=l A #quarters=l 


0.751 


0.420 



labels for Ci 


p{k\x) 


p(x\k) 


#bars= 1,2,3,4 


0.782 


0.800 




labels for C3 


pirn 


p{x\k) 


#circles=l,2 A #crosses=0 
#circles=l,2 A #quarters=0 
black=T A #circles= 1 
blue=F A #circles=l 
botright= green A #circles=l,2 
topleft=orange A #saltires=0 
topleft=orange A #crosses=0 
mainliue= orange A #crosses=0 


0.781 
0.781 
0.766 
0.765 
0.781 
0.999 
0.970 
0.970 


0.540 
0.540 
0.225 
0.200 
0.181 
0.135 
0.135 
0.135 



labels for C4 


p{k\x) 


p{x\k) 


#crosses= 1 A #saltires=0 


0.810 


0.81003 


#crosses=l A #quarters=0 


0.829 


0.81002 


#crosses= 1 A #sunstars=0 


0.751 


0.720 


#circles=0 A #crosses= 1 


0.768 


0.640 


green=F A #crosses=l 


0.757 


0.500 


#colors=2,3 A #crosses=l 


0.759 


0.490 


gold=F A #crosses=l 


0.754 


0.356 



labels for C5 


p(.k\x) 


p(x\k) 


#bars=0 


0.803 


0.900 


#circles=0 


0.752 


0.900 


#crosses=0 


0.755 


0.600 


#quarters=0 


0.752 


0.400 


triangle=T 


0.889 


0.240 


botright=black; 


0.888 


0.080 


mainhue=black; 


0.799 


0.040 



labels for Ce 


p(.k\x) 


p(x\k) 


#saltires=0 A #quarters=l 
topleft=blue A #quarters= 1 


0.960 
0.875 


0.360 
0.320 



have to find such exact boundaries. Furthermore, traditional rule induction methods often suffer from a so-called 
rare-class problem [27] when we have imbalanced or many clusters (if there are many clusters, each cluster is 
relatively rare). For example, small groups of instances (small disjuncts) in a rare class are often missed. Actually, 
in the zoo dataset, C4.5/RIPPER only generated the rules for the cluster C3 ("birds"): "feathers=True" C3, 
and "feathers=False" => -1C3, and the rest of the antecedent patterns we found (Table [U bottom-right) were 
ignored. This is presumably because most of the instances have been covered by the simple rules above in the rule 
construction process of C4.5/RIPPER. It is reported that a classifier based on emerging patterns works well for 
the rare-class problem Il28l . 

Recently it is proposed in [TSl to unify three similar data mining tasks, contrast set mining, emerging pattern 
mining and subgroup discovery, under the name of supervised descriptive rule discovery. Our labeling method 
can be seen as a model-based approach in this framework, which focuses on interpretation/evaluation of proba- 
bilistic clusters. In a broader context, for knowledge discovery under an unsupervised setting, a sequential run of 
clustering and discriminative labeling would be a promising alternative to frequent pattern mining. Besides, also 
recently, Zimmermann and De Raedt introduced a general data mining task called cluster-grouping ||291 . and a 
branch-and-bound algorithm, named CG, for this task. CG efficiently finds characteristic patterns (labels, in our 
case) following a guide from a convex relevance score such as information gain (used in ID3), WRAcc (Sec- 
tion |23]l and category utility (used in COBWEB). Although this algorithm is powerful, it could not be directly 
applied to our labeling problem, since the membership probability p{k \ x) seems not convex. 



In the context of probabilistic modeling, the proposed method with mixture models could be extended for 
evidence-based sensitivity analysis (e.g. |30|) or explanatory analysis (e.g. |31 1) of Bayesian networks, in which 
the membership probability p(k \ x) is generalized as p(q \ e), where q is an instantiation of a query variable and e 
is an instantiation of (a part of) evidence variables, and thus we search for a minimal combination e of evidences 
which is highly influential to the observation q. To the best of our knowledge, the most recent and closest work 
is Yuan et al.'s general framework for most relevant explanation (MRE) 1 32 33 1 . Their MRE framework adopts a 
relevance score called generalized Bayes factor (GBF), defined as GBFi^(x) — p{k \ x)/p{k \ ^x) in our labeling 
problem. The MRE framework looks attractive, but seems unfit to our case for a couple of reasons. First, for the 
k-th cluster, a ranking over the propositional labels x by GBF^(jr) is different from the one used in the clustering 
step (i.e. by p(k \ x)). Second, GBF^(ji:) = \-p(x\k) ' ''p(X) ^^'^ numerically unstable when p(x | fc) ~ 1. For 
instance, we cannot order the labels x such that p(x \ k) = 1, which in fact appear in one of our experiments (i.e. 
Table[T]i. Third, the MRE framework only provides an MCMC-based approximate method or an exact (exhaustive) 
method without safe pruning (like the one based on global/local support and minimality in the proposed method) 
for finding relevant x. Lastly, the MRE papers do not describe how to handle continuous attributes. 

Handling continuous attributes is an important issue in CAR (class association rule) mining. For example, 
Washio et al. t34 1 proposed a CAR mining method that discretizes the continuous space on the fly with hyper- 
rectangular clustering. The difference from our labeling method is that we are given probabilistic clusters from 
beginning and thus we eff'ectively limit propositions to the ones of the form "a < Aj < yS", where a and /3 are 
symmetric w.r.t. the mean in the cluster Besides, as in usual CAR mining, Washio et al.'s method searches for the 
antecedent patterns x based on the local support p(x \ k). 

Section |Z2l described that the EM algorithm is adopted for clustering. We can also use the /T-means algorithm 
instead, since K-means can be seen as an instance of a parameter estimation framework often called Viterbi 
training^ tailored for a Gaussian mixture model with equal class probabilities and a common covariance matrix 
of the form cr^I ll35l . Once the model parameters have been estimated, our labeling method is applicable as written 
in this paper Similarly to the case in Section l372l when combined with /T-means, the Euclidean distance from the 
centroid (the mean in the cluster) is translated into a cumulative probability under a Gaussian distribution. 

5 Conclusion and future work 

In this paper, we proposed a new labeling method that associates propositional labels (conjunctions of attribute- 
value pairs) with the clusters obtained by mixture models, to help us interpret or evaluate the clusters. As shown 
in the experimental results, the proposed method finds a set of intuitive descriptive labels that characterize well 
or "verbalize" the clusters. The proposed method is fully applicable to various datasets including continuous at- 
tributes and missing values, and can be a new, in-depth and consistent tool for cluster interpretation/evaluation. 
Besides, the experimental results also show that the feedbacks from the interpretation/evaluation step can play an 
important role for achieving a reasonable clustering result. In future work, we would like to extend the proposed 
method to use disjunctive formulas or a richer representation. For example, we may merge two similar charac- 
teristic labels (milk=T A legs=4) and (hair=T A legs=4) into ((milk=T V hair=T) A legs=4) to gain a higher 
local support. In a purely logical sense, our labeling algorithm can be formulated under the setting of inductive 
logic programming (ILP) with a simple refinement operator As in ILP, the use of background knowledge such as 
taxonomy seems helpful for having more comprehensible descriptions. 
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